Asymptotic distribution and sparsistency 
for li penalized parametric M-estimators, 
with applications to linear SVM and logistic regression 

Guilherme Rocha * Xing Wang ^^and Bin Yu ^ 
August 13, 2009 



Abstract 

Since its early use in least squares regression problems, the i!i -penalization framework for variable 
selection has been employed in conjunction with a wide range of loss functions encompassing regression, 
classification and survival analysis. While a well developed theory exists for the f i-penalized least 
squares estimates, few results concern the behavior of ^i-penalized estimates for general loss functions. 
In this paper, we derive two results concerning penalized estimates for a wide array of penalty and 
loss functions. Our first result characterizes the asymptotic distribution of penalized parametric M- 
estimators under mild conditions on the loss and penalty functions in the classical setting (fixed-p-large- 
n). Our second result explicits necessary and sufficient generalized irrepresentability (GI) conditions for 
^1 -penalized parametric M-estimates to consistently select the components of a model (sparsistency) as 
well as their sign (sign consistency). In general, the GI conditions depend on the Hessian of the risk 
function at the true value of the unknown parameter. Under Gaussian predictors, we obtain a set of 
conditions under which the GI conditions can be re-expressed solely in terms of the second moment of 
the predictors. We apply our theory to contrast £i -penalized SVM and logistic regression classifiers and 
find conditions under which they have the same behavior in terms of their model selection consistency 
(sparsistency and sign consistency). Finally, we provide simulation evidence for the theory based on 
these classification examples. 
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1 Introduction 



When modeling the a response variable Y £ y as a function of a set of predictors X € MP, statisticians 
often rely on M-estimators for linear models defined as 



arg mm 

ael 



^■T:=im,a+xrb,t) 



(1) 



where Zi = {Yi,Xi), i = 1, ... ,n, are independent observations of Z = (Y,X) and the loss function 
L : y X M ^ M_i_ measures the lack of quality of a + Xib in representing Yi. For a given problem, 
many alternative loss functions can be used. Som e recent resul t s are aimed at comp aring the properties of 



estimates obtained from alternative loss functions dZhang , 



20041 : 



Bartlett et al. 



20061). 



The choice of an appropriate loss function must take the goal of the analysis into account. Often, the 
estimates in ([T]) are used as a tool in understanding the effects of X on Y. In that case, sparse estimates (3n 
are desirable as they select which predictors in X have an effect on the response Y. Sparse estimates are 
often achieved by a penalized estimate 



n J 



arg mm 



1 " 

-y L(Y„a + Xib)+Xn-T{b) 



(2) 



where A„ > is a regularization parameter and T -.W ^ is a function penalizing non-sparse models. 
Many alternative sparsity inducing penalties exi st and a popular farnily of such penalties is the set of £^ 



19931). The norm function given 



norms with 7 G (0, 1] used in bridge estimates (IFrank and Friedman . 

by 1 1 fell 7 := (S^=i llfejir) ^- Two important particular cases are the £ p -penalty - defined as a penalty 



on the number of non-zero terms in the estimate used in (I Akaike . 



1978 




Hansen anc 


[ Yu 


( Chen et al. 


2001 


). 



1973 



20011) . and the ^i-penalty used in the LASSO (ITibshiranil . 



1974: 



Schwarzl. 



1978 



Rissanen . 



19961) and basis pursuit 



Recently, a large number of £1 -penalized estimates based on different loss functions have been proposed 
in the literature. Some examples are th e logistic regression and Cox's proportion al hazards model loss 



(ITibshiranil . 



19971 : 



Park and Hastid. 



quantile regression loss (|Li and Zh 



] 1 

(|Li and ZhuL 



20061). t he hinge loss function for classification (jZhuetal 



20041), the 



20081), and the log-determinant Bregman divergence of covariance 



2 



matrices (IBanerjee et al.U2005uRavikumar et al.U2008l) . S imultaneously, m any families of sparse inducing 



penalties have be en introduced such as the SCAD penalty SFan and Lil. 



(IFriedman 



200 ih and the generalized elastic net 



20081) . In this paper, we present theoretical results allowing the behavior of estimates based on 
different loss and penalty functions to be compared. 

Our first main result is a characterization of the asymptotic distribution of the penalized estimates in Q 
for a wide class of penalt y and convex loss fun ctions. Our result extends previous results for the squared 



error loss and £j norms by lKnight and Ful (120001) and applies to the classical asymptotic setup (large n, fixed 
p). We state our results in a modular fashion so they encompass several combinations of loss and penalty 
functions. We provide sufficient conditions on the loss and on the penalty functions for our results to apply. 
On the loss side, our results depend on convexity of the loss function and on the risk function defined as 



R{t) := Ex,Y [L{Y, a + 6^X)] , for (a, b) G M^+p, 



(3) 



to be twice continuously differentiable at the "true" value of the parameters {a, (5) 



{a,P) := argmini?(a, 



(4) 



Our second result obtains necessary and sufficient conditions for the -penalized estimate in Q to con- 
sistently select the zeroes (sparsistency) and signs (sign consistency) in the parameter f3. Previous results 
for ^1 -penalized least squares linear regression show that the set of active and inactive predictors must be 
sufficiently disentangled for sparsistency to hold. This r equirement is embodied in "incoherence" or "irrep 



resentability" conditions (jMeinshausen and BiihlmannL 



20041 : 



Zhao and Yu . 



20061 : 



ZouL 



20061 : 



Wainwright . 



20061). We call the condition for sparsistency and sign-consistency of general ^i-penalized M-estimators 
the generalized irrepresentability (GI) condition. Intuitively, the GI condition can be interpreted as a re- 
quirement that the the effects of active and inactive predictors on the loss are distinguishable enough (after 
"controlling" for the intercept term). This second result relies on the quadratic approximation developed on 
the first result and is thus only applicable on the classical small-p-large-n case. 

Our third result shows that, if the predictors are zero-meaned Gaussian and the response variable only 
depends on X through an affine function of the predictors, the conditions for -penalized estimates as in (|2]) 



3 



d o not depend on the 
in 



OSS function. In that case, the GI condition reduces to the "irrepresentable" condition 
Zhao and Yd (|2006h . This surprising result stems from the properties of the multivariate Gaussian distri- 
bution, namely on its linear mean and constant variance when conditioned on one of its linear combinations. 

We apply the theory to contrast and compare linear classifiers based on the hinge loss (parametric SVM) 
and logistic regression. We obtain expressions for the Hessians of the SVM and logistic regression risks and 
characterize them as weighted averages of the second moment matrices of the predictors conditional on a 
properly defined "linear predictor" variable M = a + /9^X. Based on this characterization and using our 
third result, we show that, for a given joint distribution (Y, X) where the predictors are Gaussian and the 
response variable only depends on the predictors through an affine transformation, the two classifiers are 
either both sparsistent or not. For more general joint distributions, one of the classifiers can be sparsistent 
while the other is not. Over a set of cases where the predictors are mixed Gaussian, we observed logistic 
regression to be sparsistent more often than SVM classifiers but also observed mixed results in finite samples. 
The conditionally weighted second moment characterization of the Hessians also evidences that the Hessians 
of both SVM and logistic regression risk functions emphasize the second moment of the predictors closer to 



the optimal separating hyperpla ne. This emphasis on the region clos e to the margin echoes 



previous results 



in the non-parametric works of 



Audibert and Tsybakovl (|2007l ) and 



Steinwart and Scovell (|2007n and help 



explain the similarities between SVM and logistic regression classifiers. 

The remainder of this paper is organized as follows. Section |2] presents our asymptotic results for 
penahzed empirical risk minimizers for general loss and penalty functions. Section [3] presents necessary 
and sufficient conditions for model selection consistency of £i-norm penalized empirical risk minimizers for 
general loss functions. Section|4]applies the results in the previous two sections to the study and comparison 
of SVM and logistic regression classifiers satisfy the requirements for our results from the previous two 
sections to apply. Section[5]shows a series of simulations providing empirical support for the model selection 
consistency theory we developed as well as comparisons between SVM and logistic regression classifiers. 
Finally, Section [6] concludes with a brief discussion. 
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2 Asymptotic distribution of penalized parametric M-estimators 

In this section, we present the first main result of this paper (Theorem lU which characterizes the asymptotic 
distribution of penalized empirical risk minimizers for a broad ra nge of penalty a nd lo ss functions for a 



fixed number of predictors. Theorem |4] extends previous results by 



Knight and Ful (120001) regarding norm- 



penalized least squar es estimates. In essence, the steps in the proof of |4] closely parallel the ones used by 



Knight and Ful (120001) . but we keep the study of the convergence of loss and penalty functions separate so 
our results can be applied to any combination of loss and penalty functions satisfying the conditions detailed 
below. 

Before proceeding, we introduce some notation. Our results apply to penalized estimates defined as 



arg mm 



n 

-y^L{Zi,t) + K-T{t) 
n ^ 



i=l 



(5) 



The definition in (jlll is a particular case that encompasses linear models by setting Zi = {Yi,Xi) € 3^ x 
W, t = {a,b) and L{Zi,t) = L(Yi,a + b'^Xi). In this extended case, the best model we can select is 
parameterized by 



argmin R{t), 



(6) 



where the risk function has the usual definition 



R{t) := Ez [L{Z, t)] , for t E e C W^^. 



(V) 



Let u G W^^ and qn be a sequence of non-negative numbers such that — > oo as n ^ cxd. Define 



(Z,u) 

yJ")(Z,A„,n) 



T{o + t)-T{e) , 

c'^^\z,u) + \n-Gf\u). 



and 
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The Vn^^ function corresponds to a recentered and rescaled version of the objective function in ([5]l so 



arg min yj-"-* (Z, A„, u) 



The asymptotic behavior of 6n{\n) can be char acterized in terms of 



minimizes A close study of the proof used by 



in) 

asymptotic results for Vq and its 



Knight and Ful (120001) shows that for the most part, the 



convergences of the loss ^C^"^^ and the penalty {g^^^^ functions are studied separately. This is reflected in 
our Theorem[T] a versatile and an important "assembling" tool. Any set of assumptions made on the loss and 
penalty functions that ensures the conditions required by Theorem[T]can be used to obtain a characterization 
of the distribution of penalized estimates. 

Theorem 1. Let Xn > be a sequence of positive (potentially random) real numbers, Zi, i = I, . . . ,n, be 
a sequence ofi.i.d. realizations from a distribution Pz, L : Z x Q ^'E.be a loss function and T : ^ M 
be a penalty function. Let 6{\n) be as defined in ([S]). 

Suppose there exist functions Cq , Gq, a constant A, a random vector W and a sequence Qn of determin- 
istic positive real numbers with ^ oo as n ^ oo such that, for any compact set K C W: 



i) sup 

ii) sup 



Er=i L[z„e + f^)-Liz„e) -Q(w,n) 



0; 



T{e + ^)-Tie) -X-Geiu) 



Hi) 9n{\n) is Op{qn^). 

Let ye(W, u) = Ce(W, u) + A • Ge{u). 
Ifi) and ii) hold, then: 



a) sup 



vi''\z,Xn,u)-Ve{W,u] 
Ifi), ii) and Hi) hold, then: 



0; 



b) qn (^n(A„) -6) ^ argminV0(W,u). 



Roughly speaking, we can prove Theorem [T] by observing that boundedness in probability of the se- 
quence 9n{Xn) implies that 6n{Ki) G K, for some compact set K with probability approaching 1. Given 



this condition, it follows that the uniform convergence in probability over compact sets is sufficient to en- 

(n) 

sure that the minimizer of Vg converges in probability to the minimizer ofVg. A detailed proof is given in 
Appendix lAl 

Based on Theorem [T] we now proceed to study the loss and penalty functions separately. 
2.1 Loss functions 

We now establish sufficient conditions for the loss function to display the convergence required in Theorem 
[U Our results use standar d approximatio ns for the loss function in terms of the risk function combined with 



the Convexity Lemma by 



Pollard 



(1199 ll) . which is used as a tool to upgrade pointwise convergence results 



to uniform convergence over compact sets. 
Loss Assumptions (LA) 



Ll. The parameter 6 = arg minE [L(Z, t)] is bounded and unique; 

tee 



L2. E|L(Z,t)| < oo for each t; 



L3. The loss function L(Z, t) is such that: 



a) L{Z,t) is differentiate with respect to t at t = 6 for Fz-almost every Z with derivative 
VtL{Z,e) and 



J(e) := E[VtL{Z,e)VtL{Z,ef] <oo; 



(8) 



b) the risk function R{t) = E [I/(Z, t)] is twice differentiable with respect totatt = 9 with positive 
definite Hessian matrix 



me)] 



' dtidtj 



92 (E[L(Z,t)]) 



dtidtj 



(9) 



L4. The loss function L{Z, t) is convex with respect to its argument tfor ¥z-almost every Z. 



Assumptions L1-L4 - the L being a mnemonic for the loss function - are relatively mild. The first 
assumption on the loss function (Ll) ensures that the parameter in (l6l) is well defined and is thus a minimal 



requirement. Assumption L2 yields that a law of large numbers is valid for each value of t, and thus that 
the risk function equals the pointwise limit of the empirical risk. In our proofs, assumption L3 is used 
extensively to obtain local quadratic asymptotic approximations to the risk function around the parameter 
9 that are pointwise valid around 6 (i.e., for each + ^ for a sequence < g„ ^ cxd as n ^ oo). The 
requirement that the risk function is twice differentiable does not require differentiability of the loss function 
itself, as will become evident in our analysis of the hinge loss in Section |4l Finally, assumption L4 is used 
to upgrade the local approximation f or the risk func tion from pointwise to uniform over compact sets by 



means of Pollard's convexity lemma (|Pollard . 



199 Ih . Alternative assumptions can replace L4: any set of 
conditions yielding uniform convergence over compact sets will do. One could, for instance, re place it by 
conditions on the local complexity/entropy of the loss function (see, for instance. 



Dudley 



1999). We stick 



to convexity here giv en its computational convenience and widespread use in statistics and machine learning 



(Bartlettetal. 



Lemma 2. Under the LA assumptions LI, L2, and L3: 



a) There exists a p-dimensional random vector W ~ N (0, J{9)) such that 



1 " 
n ^ 



L{Z,,9 + — ] -L{Zi,i 



[u^ ■ H{9) ■ u + W'^ -u] ^0, for each u G W. 



b) If,in addition, LA assumption L4 holds, then: 



b.l) for every compact subset K C W, 



sup 

u&K 



1 " 

-E 

i=l 



L{Zi,9 + —] -L{Z„e) 



[lF ■ H{9) • M + 



0, and 



b.l) V^-0„(O) = Op(l). 



Our proof of the pointwise convergence (a) and of boundedness of the M-estimator (b.2) is offered in th e 
Appendix lAl It can be seen as an extension of the results for the absolute error loss due to IPoUardl (1199 ll) . 



The upgrade from pointwis e convergence to uniform convergence over compact sets is a direct application 



of the Convexity Lemma in 



Pollard 



(1199 ih . 
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2.2 Penalty functions 

Lemma[3]establish conditions for non-adaptive penalties to satisfy the conditions required by Theorem[T] 
Penalty Assumptions (PA) 

PI. T : Q ^ is non-random and T{t) > Ofor all t S B; 
P2. T is continuous in t £ O; 



P3. The function 



Go(u) := lim 
hio 



T{9 + u-h)-T{9) 



h 



(10) 



is well defined and continuous for all u G M^; 



P4. The set {t G : T{t) < c} is compact for all c < T{0). 



The set of assumptions PI through P4 on the penalties - P is a mnemonic for penalty function - is broad 



enoug h to encompass all norms with 7 > and th e set of generalize d elastic net penalties in 



Friedman 



(|2008h . With minor adjustments, the SCAD penalty (|Fan and Li . 



200 ih can also be treated by our theory. 



We emphasize that convexity is not a requirement. Non-randomness and continuity (assumptions PI and 
P2) make it easy to obtain uniform convergence over compact sets. Condition P3 is similar but milder than a 
differentiability requirement. We prove that the penalty function converges uniformly over compact sets by 
using conditions PI through P3. Condition P4 is useful in ensuring that the penalized estimates are bounded 
in probability. It amounts to a requirement that the penalty function T constrains the penalized estimates to 
be within a compact set for all A > 0. 

Lemma 3. Let 9 be as defined in qn be a sequence of non-random positive real numbers satisfying 
(7„ — > 00 and A„ be a sequence on non-negative (potentially random) real numbers with ' Qn^ A 
n ^ 00. Suppose that the T is a penalty function satisfying the PA conditions PI through P3. Then, for all 
compact subsets K G W^: 



sup 

u<^K 



An • 



T{9 + 



T{9) 



X-Ge{u) 



0, as n 00. 
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A proof for Lemma[3]is offered in Appendix lAl 
2.3 Convergence of penalized empirical risk minimizers 

We now state our first main result, wliicli cliaracterizes the asymptotic distribution of penalized parametric 
M-estimators. 

Theorem 4. Assume A„ be a sequence of non-negative (potentially random) real numbers such that A„ • 
\ > Q as n ^ oo. Let 0, 6n{K), J{0), H{6), and Ge{u) be as defined in Q, ([Sll,®, and (flOl ) 
respectively. Define: 

Ve{w,u) = ■H{e)-u + vF ■u + \-Ge{u)JorweW. 

If the loss function satisfies the LA assumptions and the penalty function satisfies the PA assumptions, then 
there exists a p-dimensional random vector W ~ N (0, J{6)) such that: 

Proof Theorem m 

In the appendix, we prove that 9n{0) = Op{l) implies On{Xri) = Op{l) for all An > (Lemma [TTI). 
Thus, under the assumptions made. Lemma |2] along with Lemma [TT] ensures that conditions (i) and (iii) in 
Theorem [Dare satisfied. Additionally, Lemma [3] ensures that condition (ii) in Theorem [T] is met. The result 
then follows directly from Theorem [T] □ 

We emphasize that the approximation afforded by Theorem |4] is valid for the unique minimizer of the 
risk function 6 as defined in As the penalty function is not assumed to be convex, local minima may exist 
in finite samples. However, the conditions in Theorem |4] ensure that asymptotically the penalty component 

(n) 

of the Vg function is negligible in comparison to the risk component and asymptotically the minimizer is 
unique. 

In the next section, we use the asymptotic characterization of the distribution of £i-norm penalized 
empirical risk minimizers in Theorem |4] to obtain necessary and sufficient conditions for the existence of a 
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sequence of tuning parameters A„ for which 9n{^n) is model selection consistent. 



3 Model selection consistency of -penalized for M-estimators 

Our main result concerning ^i-norm penalized estimates gives necessary and sufficient conditions ensuring 
the existence of a sequence of regularization parameters A„ such that ^n(A„) correctly identify the signs 
of the entries in the optimal vector of coefficients as defined in Q as the sample size increases. Before 
we can state this result, we must introduce some notation and terminology. To allow the usual practice of 
including non-penalized intercepts to linear models, we write the risk minimizer as = (a,/5) G M^^^, 
where only the coefficients in /3 G are included in the -penalty. We define a partition of [3 in terms of 
its sparsity pattern: 

^ = {jG{l,...,ri:/3, ^0}, and = {j G {1, . . . ,p} : = 0} . 

We let q denote the number of indices in the set A. We will say that an estimate ^n(A) is sign-correct if 
sign(/3„(A)) = sign(/5), where sign(t) for a vector t G is a p-dimensional vector with: 



1, if > 0, 

0, if = 0, and (H) 



-1, if < 0. 



We will say that a sequence of estimates of regularization paths b{.) : M — > is sparsistent and sign 
consistent if there exists a sequence A„ of (potentially random) non-negative values of the regularization 
parameters such that 

lim P ( /3„(An) is sign correct ) = 1- 

We emphasize that the definition requires only the penalized components of 6 to be asymptotically sign- 
correct. 

For a risk function satisfying assumption L2 above, rearrange and partition the {^ + q + {p — q)) x 
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{I + q + {p — q)) Hessian: 



Hi9) 



Ha,a{G) Ha,A{(^) Ha^A^iG) 

HaA&) Ha,a{G) HaM^) 
HAr,a{^) Ha^a{G) HacM(^) 



(12) 



Theorem 5. Let 9n{^n) = yo.n{^n), Pn{^n)j be as defined in ((S) above with an li-penalty applied only to 
the terms in Pni^n)- Suppose the loss function satisfy the conditions in Assumption Set 1 and define 



7?(0):=1- HA^^A{e)[HAA{G)-HAAG)HaAG)~^H^AiG)\ \ign{(5A] 



> 0. 



a ) Let Xn is a sequence of non-negative (potentially random ) real numbers such that such that Xn ■ n 



(13) 



-1 P 



0, and Xn - n 2 ^ > 0/or some 0<c< ^a^n^oo. IfrjiO) > 0, then: 



sign [(3n{Xn) ] = sign{(5) > 1 - exp[-n 



b) Conversely, ifri{9) < 0, then, for any sequence of non-negative numbers Xr, 



lim P 

n— >oo 



sign /3„(A„) = sign{P) 



< 1. 



The result in Theorem [5] extends the model selection consistency results in lZhao and Yul(l2006h concern- 
ing LASSO estimates (based on L2-I0SS) to more general parametric estimates defined as ^i-norm penalized 
M-estimators based on loss functions satisfying the conditions in Assumption Set 1. We will call the con- 
dition in (fT3] ) the generalized irrepresentability condition (GI condition) which in the case of the L2-I0SS 
with zero-mean predictors recovers Zhao and Yu's irrepresentable condition. Accordingly, we call r]{9) the 
GI index, which can be interpreted as a measure of incoherence between the active and inactive predictors. 
Positive values of ri{9) imply the effects of active and inactive predictors are distinguishable enough so 
the ^1 -penalized estimate can correctly identify the signs of all coefficients in the optimal model given a 
sufficiently large sample size. 



A condition similar to the generalized irrepresentability condition (113]) appears in 



Ravikumar et al. 
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dlOOSh . There, the GI condition is used to obtain sufficient conditions for the consistent selection of the 
terms of an infinite dimensional precision matrix estimate defined as the ^i-norm penalized minimizer of 
the log-likelihood loss for Gaussian distributions. This suggests it is possible to extend Theorem |5] to the 
non-parametric setting where the number of regressors p grows with the sample size n (i.e., p = pn —>■ oo 
as n ^ oo). Such extension will be the subject of future research. 

Finally, we would like to emphasize that, even if ri{6) < 0, it may be possible to correctly recover the 
signs of P with a relatively high probability. What the converse in Theorem |5] says is that this probability is 
bounded away from 1 in the limit. 



3.1 Simplification of the GI condition under linear models and Gaussian predictors 

Our next result gives sufficient conditions for the r/(a, f3) to be computable directly from the covariance of 
the predictors. This result is limited to £i-penalized linear models as defined in Since the loss function 
only depends on X through an affine transformation, the Hessian H{a, h) of the risk function R{a, h) as 
well as the covariance matrix of scores J(a, h) involves the expected value of an expression involving the 
second order cross products in the matrix Q(X) defined as 



Q(X) : = 



1 X^ 
X XX^ 



Theorem 6. Let the coefficients of a linear model (a, /?) be as defined in (|4]). If 



X - iV(0,S), and 



b) the Hessian of the risk function in Q can be written in the form 



(14) 



H{a,(3) 



E 



E 



Q(X) 



Q + X^/3 



for some function w : 



I, (15) 



then rj{a, /3) = 1 - [E (X^cX;^)] [E (X^X;^)] ' signip^)- 

A proof is given in Appendix |Al Theorem [6] tells us that, for zero-mean Gaussian predictors and loss 
functions whose Hessian can be expressed as a weighted "average" of second moments of X conditional on 
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the linear predictor variable Mq,j}(X) := a + X-^ • /3, the GI condition can be computed directly from the 
matrix of second moments E [XX^] . In Section HI we will see that Theorem [6] holds for linear SVM and 
logistic regression classifiers. s Besides the particular cases studied in Section IH we notice that Theorem [6] 
can find ample use for £i penalized estimates in view of our next result. 

Corollary 7. Suppose that: 

X - iV(0,S), 

b) L{Z,t) = L(Y,a + 6^X), 

c) L(Y, a + X^6) is twice dijferentiable in its second argument for almost every Y, and 

d) Y ± X|a + /?^X. 

Then, r]{a,P) = 1 - [E (X^cXJ)] [E (X^X;^)] 

Proof. Let ^ ^^^g")^"^ denote the second derivative of L with respect to its second argument. Since 

Y _L X|q + /3'^X, we get 



H{a,(3) = E 
= E 



Q(X) 
Q(X) 



a + X^/3 



a + X^/3 



E 



a^L(Y,a + X^/?) 



a + X^/3 



Condition (b) in Theorem[6]is thus satisfied with w{a + /3^X) = E 



a^L(Y,a+X^/3) 
(dvp 



a + X^/3 



□ 



Corollary |7]shows that, if the predictors are Gaussian and the response X only depends on X through an 
affine transform, the conditions for model selection consistency of many Generalized Linear Models (GLMs 



Nelder and Wedderbum 



1972|) only depends on the covariance between relevant and irrelevant predictors 
even if the model is not correctly specified. For canonical GLMs, condition (d) can be relaxed as the weight 
function can be shown not to depend on the response Y. 

As we will see in the case of the hinge loss, twice differentiability of the loss with respect to its second 
argument is not essential. For condition (b) in Theorem |6] to be satisfied, what seems to be essential is that 
the loss has the form shown in Q and that Y is conditionally independent of X given a + X^/3. 
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4 Application to SVM and logistic regression classifiers 

We now obtain the limiting behavior of some linear classifiers to study the model selection consistency of 
their ^i-penalized estimates. We will use these results along with Theorem [5] to study the model selection 
consistency of ^i-penalized SVM and logistic regression classifiers. The response variable Y G {—1, 1} is 
modeled in terms of a linear transformation of a set of predictors X G M*'. Setting some of the coefficients 
on the estimates of the (3 parameter to zero corresponds to eliminating some effects from the model thus 
leading to more interpretable models. 

In what follows, we will characterize the asymptotic behavior of the loss functions associated to logistic 
regressi on and support vector machine s . Logistic regressions are a p articular case of Generalized Linear 



Models (INelder and Wedderbuml . 



19721 : 



19891) and a re widely used by s t atistic ians 



19951) are 



McCuUagh and Nelded. 

when modeling the outcome of binomial variables. Support vector machines (ICortes and Vapnikl . 
amply used for obtaining linear classification rules and is based on the hinge-loss function. For both the 
logistic regression and support vector machines, t he corresponding loss funct i ons ar e often interpreted as 



convex surrogates for the — 1 classification loss dZhang 



2004 



exist for obtaining both the £i-norm penalized SVM (IZhuetal 



Bartlett et al.i.i2006 | ). Efficient algorithms 



20041) and logistic (IPark and Hastiel 



20061) 



classifiers. Both SVM classification and logi stic regression have been used to selec t relevant predictors in 



(see, for instance 



Joachimsl . 



19981 : 



Ibr instance 


Guvon 


et al. 


2002 


Meier et al. 


Genkin et al. 


2007 


). 



20061) and text categorization 



We now set up terminology and notation we will use in connection with the SVM and logistic classifiers 
for the remainder of the paper. Given a value for the parameters in the linear classification model t = 

(a, b) £ M^+P, a linear classification rule is defined as 



Y (X|t) = sign (a + X^6) . (16) 
The separating hyperplane H{t) associated to a linear classification rule as in ([T6l) is defined as 

n{t) := {x G : a + x6 = 0}, fort = (a,6). (17) 
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The set H{t) defines the boundary in the predictor space between the points where, for the linear classifica- 
tion rule based in t, the response variable is predicted to be 1 (the set {x : Y(x|t) = 1} = {x : a+x*6 > 0}) 
from the points where Y is predicted to be —1 (the set {x : Y(x|t) = 1} = {x : a + x*6 < 0}). We call 
optimal linear classification rule the classification rule corresponding to setting t = 9 and the estimated 
linear classification rules the classification rule formed by setting t = 9n{^n) with ^^(A) as defined in ([5]l. 
We define the linear predictor variable: 

M := a + X^/?, (18) 

which measures the distance from point X to the separating hyper-plane defined by the optimal linear 
classifier. If the distribution of Y only depends on X through a linear combination, both the linear SVM 
and logistic regression are known to recover the optimal Bayes classifier. We also define the true conditional 
distribution of Y given X as: 

p(X) =P(Y = 1|X). (19) 

4.1 Regularity conditions and model selection consistency for SVM and logistic classifiers 

Before we can use the results from Section[3]to study and compare the ^i-penalized SVM and logistic linear 
classifiers, we must obtain a set of conditions on the joint distribution of (X, Y) such that the hinge and 
logistic regression losses satisfy the requirements on loss functions laid out in Assumption Set 1. Conditions 
C1-C3 - C a mnemonic for classification - gives one such a set of sufficient conditions in terms of the 
marginal distribution of the predictors X and the conditional distribution of Y given X. 

Classification Assumptions (CA) 

CI. var [X |Y] G RP^p is a positive definite matrix for Y G {1, —1}, 
C2. The distribution of^ has a density /x(x) > 0,/or all x G W, and 

C3. p(X) G (0, 1) for almost every X, that is, for all values X in the support of the distribution ofJi., Y 
can assume any of its two possible values; 
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Condition CI rules out the case of perfectly correlated predictors and is required to ensure uniqueness 
of the minimizer 9 as defined in Q. Assumptions C2 and C3 are used to ensure the SVM and logistic 
regression loss functions satisfy the assumptions in Lemma|2j but can be relaxed. 

The remainder of this section describes linear SVM and logistic classification, shows how Conditions 
C1-C3 ensure their corresponding loss functions are amenable to the theory laid out in Section[3]and provide 
expressions for the covariance matrix of scores J{9) and the Hessian H{9) for the risk functions associated 
to the linear SVM and logistic regression classifiers. 



4.1.1 Logistic Regression 



The c anonical logistic regression is one instance of Generalized Linear Model (INelder and Wedderbuml . 



1912 ) where the probability of Y = 1 is modeled as: 



(Y = l|X,a,6) 



exp(a + 
l + exp(a + X^6)^ 



(20) 



where a € M and 6 G M*' are parameters to be determined. The population parameters a and /3 are defined 
as the minimizers of the KuUbach-Leibler divergence between the true conditional distribution of Y given 
X and the Bernoulli distribution with parameter given by ( |20l ). The corresponding loss function is: 



L{Y,a + b^X) 



• I (Y = 1) - I (Y = 1) • X^ • 6 + log [1 - exp (a + X^ • b)] , (21) 



where I(Y = 1) is the indicator of Y = 1. An estimate for 9 = (a,/3) is obtained by minimizing the 
empirical risk with respect tot = (a, b). 

Lemma 8. Suppose that the conditions in Assumption Set 3 are observed. Then, the logistic regression loss 
function (1211) satisfies the conditions in Assumption Set 1 with: 



J{9) 
H{9) 



E 



E 



Q(X) 
Q(x) 



p(X)-2.p(X) 

exp (a + X/3) 
;i + exp(a + X/3))' 



exp (a + X/3) 



+ 



1 + exp (q + X/3) \ 1 + exp (q + X/?) 



exp (a + X/3) 



, and 
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A proof is given in Appendix |Al The expression for the Hessian of the logistic loss can be rewritten as 



E 



E 



Q(X) 



Q + X/3 



exp (a + X/3) 
;i + exp(a + X/?))' 



(22) 



and hence satisfies the conditions of Theorem [6] even if the model is not correctly specified. Indeed, the 
Hessian for the logistic risk does not depend on the distribution of Y at all. 

In addition, equation ((22]) tells us that the Hessian for the logistic regression risk function is a weighted 
average of second moment matrices conditional on the linear predictor variable a+X/3. Because — cxp(a+x/3) 



(l+exp(Q:+X/3))^ 



is an even function of the linear predictor variable, the matrices of conditional second moments at predic- 
tor variables that are equally distant from the separating hyperplane are equally weighted. In addition, the 



higher weight is given to E 



Q(X) Q + X/3 = 



and the weighting is decreasing on the absolute value 
of the linear predictor variable. As a result, in what concerns asymptotic model selection consistency of 
£i-norm penalized logistic coefficient estimates, the correlation structure of the predictors on regions closer 
to the separating hyperplane have the most importance con firmi ng the margin phenornenon o bserved earher 



in non-parametric works by 



Audibert and Tsvbakovl (l2007h and 



Steinwart and Scovell (120071) . 



4.1.2 The parametric SVM: linear classification with the Hinge loss function 

Classification by means of Support Vector Machines with linear kernel was first introduced in the case where 
it is possible to perfectly separate the space of predictors X according to the the binomial variable Y. In that 
setting, the SVM parameters define a hyper-plane (characterized by the parameters a, 13) that maximizes the 
gap between the classes: 



(«,/?) 



arg mm 
a,b 

S.t. 



Xjb) > 1, foralH = l,...,n. 
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To adapt this method to the "no perfect-separation" case, non-negative slack variables are introduced and 
the optimization problem becomes 



(a,/?) 



arg mm 

a,b 
S.t. 



\2 + C■E^=l^i 

Yj • (a — 6) > 1 — .^j, for alH = 1, . . . , n, and 



for alH = 1 



n, 



where C is a constant controlling the trade-off between margin maximization and total amount of slack. 
The "lack of fit" in SVM is measured by the total distance of the misclassified points to the classification 
boundary, represented as the sum of the slack variables. The Euclidean norm acts as a penalization term: in 
the perfect separation case it unsureness uniqueness of the solution. More consistently with the form in 
the empirical SVM parameter estimates can then be rewritten as: 



(aj) = argmin^L(Yi,a + 6'^X) + A- ||/?||2, with 



L{Yi,a + b^X) 



1 - Y, (a - Xf 6) 



• I [l — Yj • (a — X^6 > O)] , the hinge-loss function. 



Here, we will consider the hinge loss on its own, in the spirit of the "assembling" Lemma [T] The next 
result establishes that under the conditions of Assumption Set 3, the hinge loss satisfies the assumptions in 
Theorem |2l 

Lemma 9. Suppose the conditions in Assumption Set 3 hold. If in addition /3 7^ 0, then the hinge loss 
function (1231) satisfies the conditions in Assumption Set 1 with: 



j{e) = E 



H{e) = E 



p(X) • 1(1 - a - X^ /? > 0) + (1 - p(X)) • 1(1 + a + X^/3 > 0) 



Q(X) 



and 



p(X) ■6{l-a- X^/3) + (1 - p(X)) ■6{l + a + X^/3) 



•Q(X) 



where 5 denotes Dirac delta function. 



The expressions for J{6) and H{6) in Lemma |9] closely parallel results by 



Kooetal 



(l2008h concerning 



the Bahadur representation of the linear support vector machines. In Appendix|Aj we present an alternative 
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proof similar in spirit to the construction by lPhillipsI (119911) . In lKoo et alj (120081) conditions ensuring /? 7^ 
are also obtained. 

Borrowing from the terminology for support vector regression, we call the set where a + X^/? = — 1 
the negative "elbow" of the SVM risk. Similarly, the positive "elbow" of the SVM risk is the set where 
a + X^/? = 1. Assuming that Y is independent of X given a + X^/5, the expression for the Hessian in 
Lemma |9] can be rewritten in a more revealing form in terms of conditional expectations at these elbows of 
the SVM risk: 



E 

+E 

E 

+E 



E 



Q(X) a + X^/3 
Q(X)|a + X^/3 
Q + X^^ = 1 



Y = 1 la + yJl3 • 5(1 - a - X^/?) 



E 

Q(X) 
Q(x) 



Y = l|a + X^/3 



•5(-l-a-X^/?) 



(23) 



a + X^/3 = -1 



l|a + X^/3 = lj •/(!) 
{ = -l\a + yJfi = -l] ■ f{-l 



where / denotes the density of the linear predictor variable a + X^/?. This representation for the Hessian 
of the linear SVM risk (expected value of the hinge loss over Y and X) shows that if Y is independent of 
X given a + X^/? the hinge loss function is amenable to the results in Theorem |6] It also provides many 
insights into the behavior of the linear SVM classifier. 

Equation (1231 ) tells us that the Hessian of the SVM risk is a weighted sum of the conditional sec- 
ond moments of the predictors given that the linear predictor variable a + X^/3 is at the elbows of the 
SVM risk. According to Theorem \5\ the generalized irrepresentability condition is not affected if the Hes- 
sian matrix is multiplied by a constant. It follows that, with respect to model selection consistency of 
£i-norm penalized linear SVM classifiers, the scalar factors P ^Y = — 1 |a + X-^/3 = — 1^ • /(— 1) and 
Y = — 1 la + X^/3 = —1 ) • /(—I) only determine the relative importance of the two conditional sec- 



ond moment matrices, E 



Q(X) 



a + X^/3 = 1 



Q(X) 



a + X^/? = -1 



, in the composition of 



and E 

the Hessian. If the two conditional moment matrices happen to be equal, the scalar factors have no bearings 
in whether the generalized irrepresentable condition is met or not. If the two conditional moment matri- 
ces are different, the relative importance of the conditional second moments at the two elbows depends 
on the density of the linear predictor variable a + X^/3 and how well defined a class is at each of the el- 
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bows. For example, if /(I) > /(-I) and P Y = 1 a + ^ = 1 > 



-1 |a + X^^ = -1 , 



the SVM Hessian will be largely determined by the second moment of the predictor at the positive elbow 



E 



Q(X) 



a + X^/? = 1 



, which in turn will have the most influence in determining whether £i-norm pe- 
nalized SVM classifier is model selection consistent. 

In addition to determining the weighting between the conditional covariances, the density of the pre- 
dictors and the probabilities of Y belonging to each class on the positive and negative can inflate or 
deflate the covariance matrix of On. Standard results conce rning parametric M-estimators (see, for in- 



stance 



Bickel and Docksum , 



2001 



Casella and Bergeii 



200 ih yield that lim„_»oo var 



n ■ 



H ^{9)J{9)H ^{0). As aresult, the higher the density of the predictors and the easier the separation of the 
classes at the elbows, the larger the Hessian and the less variable the coefficients in the SVM classifier. 



5 Simulations 

We now present a series of simulation results which give empirical evidence supporting the theory for model 
selection consistency for ^i-penalized linear SVM and logistic regression classifiers. In addition, we use the 
simulations to compare the model selection performance of -penalized linear SVM and logistic regression 
classifiers asymptotically and in finite samples. To avoid a simulation set-up that is biased in favor of either 
linear SVMs or logistic regression, we base our conclusions on randomly selected joint distributions for 
(Y, X), where Y is the binomial response variable and X is the predictor. We start off by detailing how the 
designs used throughout this section are sampled. 

5.1 Randomly constructing joint distributions (Y, X) 

Throughout our simulation experiments, we will call a design the joint distribution of (Y, X) characterized 
by the parameters of the conditional distribution of Y given X and the the distribution of the predictors X. 

The conditional distribution of the binomial random variable Y G {—1, 1} given X G M^' is character- 
ized by a probability profile function (7 : M ^ (0, 1), an intercept ( and a normal direction to the separating 
hyperplane u G W. Given these elements, we set P(Y = 1|X) = g{( + "K^v), so Y is independent of X 
given any one-to-one transformation of + X.'^u, in particular a + X^/?. In all designs, we set ( = 0. Given 
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e X w-1. 



a number of non-zero terms q, we partition the normal direction according to u ■ 
The non-zero component of the normal direction to the separating hyper-plane v_4 is sampled uniformly on 
the unit sphere on M'^'. One problem with this sampling scheme is that it may result in tiny coefficients which 
are hard to detect in finite samples, thus complicating the comparison between asymptotic and experimental 
results. To avoid such tiny coefficients, we discard directions having maxi< j<g \ vj\/ mini<j<q \ > 5. 
To provide stronger evidence in favor of Theorem|5l we will consider two different probability profile func- 
tions g: 

the logistic function, gi{r) := jf^^, and 



the "blip" function, g2{'i^) ■= i ( 1 + r • exp 



1— r 



The logistic function (gi) is the canonical link for Bernoulli GLM models. The "blip" function {g2) concen- 
trates all the action close to the separation boundary between the classes and is thus expected to favor SVM 
classifiers. 

For the distribution of the predictors, we consider two families of distributions: Gaussian and mixture 
of Gaussian distributions. For the Gaussian predictors, the mean is fixed at G and a covariance matrix 
S G is sampled as follows. First, S € is sampled from a Wishart(Ip,p,p) distribution, where Ip is 
the identity matrix. Then, S G M^' is normalized to have unit diagonal and S = 7 • S with the scalar 7 > 
chosen so that = cr^, where o"^ is a parameter controlling the variance of X. The mixed Gaussian 

predictors are a mixture of two Gaussian distributions with equal proportions, common variance S and 
symmetric means ^ and —fi. The parameter ^ is randomly selected as /i = | • /i • o", where /i = |x| • + w, 
with X ~ -^(0, 1) and w ~ A^(0, Ip). The common variance matrix of the components of the mixture of 
Gaussian is sampled similarly as the covariance matrix for the Gaussian case, with the difference that 7 is 
chosen so u^Hu = ^ ■ a"^ . The factors | for fj, and ^ for S are used to ensure that the contribution of the 
mean and variance for the second moment EXX^ = ^/x^ + S is somewhat balanced. 

To obtain the population parameter 9 = (a, f3) as defined in (O for each of the sample designs, we 
first notice that the probability profile functions satisfy gj{z) = 1 — gj{—z), for j = 1,2, z G M and 
the distribution of the predictors are symmetric about zero. It thus follows that the optimization problem 
defining 9 is symmetric about G M*' and we have a = for all designs. Then, because P(Y = 1|X) 
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only depends on X through X.'^u, /3 has the form P = c* ■ u, for some scalar c* E M. The value of c* 
that minimizes the risk is obtained by numerically minimizing the average of the risk function conditional 
on X for a large sample (10^) from the predictor distribution. For any given design, the value of c* differs 
depending on the risk function being used. 

5.2 Model selection consistency and the GI condition for linear classifiers 

We now provide empirical evidence of the validity of Theorem [5] for ^i-norm penalized linear SVM and 
logistic regression classifiers. According to Theorem [51 the proportion of paths containing sign-correct 
estimates should approach 1 as n ^ cxd if the GI index r]{9) is positive. 

To estimate the probability that a sample regularization path contains a sign-consistent estimate for a 
given design, we used replicates of the regularization path by sampling fr om the joint distrib ution of (Y, X) 



and com puting the regularization path for ^i-norm penalized linear SVM SLi and ZhuL 



gression (IPark and Hastie , 



2008h and logistic re- 



20061) . To compute the GI index for a given design, we can use the expressions in 
equations (l22l) and (1231 ) in conjunction with the expressions for the conditional second moments of Gaussian 
and mixed Gaussian random variables shown in Appendix IbI 

Figures [T] and |2] show plots of the proportion of sample regularization paths containing a sign-correct 
solution against the GI index r]{9) under various conditions. In all cases considered, the proportion of times 
the ^1 -penalized classifier contains a sign-correct estimate in its regularization path increases as n increases 
if rj{e) > 0. 

Figures [Hand [2] also show that that, in most cases, correct recovery of the signs of 9 is harder if r]{6) < 0. 
One notable exception occurs for mixed Gaussian predictors under the "blip" conditional probability profile. 
In that case, it is possible to have a high probability of correct sign recovery even under r]{6) < 0. Notice 
that this result does not contradict Theorem [5] Even though there the probability that the signs will not 
be recovered correctly is never zero if ri{9) < 0, it can be quite small. A more careful analysis of the 
probability of correct sign recovery must take into account the variance of the estimates (3j^\Xn) with 
indices in A"^ = {j & I, . . . , p : Pj = 0}. 

It also possible to notice that, given the asymptotic nature of the results, the probability of correct sign- 
recovery can still be small for smaller sample sizes n and for larger number of predictors p especially under 
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a fainter signal ("blip" conditional probability profile). The extension to the theory in Section |3]to the non- 
parametric case p = p„ — > cxD can potentially offer more precise answers on the how the total number of 
predictors affects the chance that the regularization path contains a sign-correct model. 

5.3 Comparison of -penalized SVM and logistic regression classifiers 

In addition to allowing us to study the model selection consistency of SVM and logistic classifiers, Theorem 
|5] along with Lemmas [8] and |9] lets us to shed some light onto a question often asked by practitioners: 
which of SVM and logistic regression classifiers should be used for variable selection? Our theoretical and 
experimental results suggest that, if variable selection is made through £i -penalization, the answer depends 
critically on the sample size available. 

5.3.1 Large sample (asymptotic) comparison 

If a large enough sample size is available. Theorem [6] suggests that in terms of variable selection by means 
of £i-norm penalized estimates logistic and SVM are equally likely to be model selection consistent for 
the designs sampled as described in 15. II For non-Gaussian predictors, a comparison of the GI indices rj{6) 
shows that model selection consistency can be theoretically guaranteed for logistic regression classifiers in 
more designs than SVM. The results are shown in Figure[3]and Table[T] Interestingly, for the distribution of 
designs considered, logistic was more Ukely to be model selection consistent even under the "blip" condi- 
tional probability profile function - thought to favor SVM by concentrating must of the class discrimination 
information on a band around the optimal separating hyperplane. 

5.3.2 Finite sample (asymptotic) comparison 

Figure |4] shows a comparison of the proportion of times the £i -penalized logistic and SVM regularization 
paths contained a model with correctly selected variables. In each plot, each point is obtained by plotting the 
proportion of paths containing a sign-correct model for logistic (vertical axis) against the same proportion 
for SVM for a given design. Thus, the further a point sits to the lower right corner, the better was the 
performance of SVM in comparison to logistic for that specific design. The proportions are obtained from 
50 replications of one of the designs sampled as described in Section [5TT] 
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Figure 1: Proportion of sample regularization patlis containing a sign-model vs. GI index ri{6) under 
Gaussian predictors: The proportion at each point is based on 50 repHcates of the sample regularization path 
for the corresponding design. The results displayed in these panels show good agreement with the theory for 
sign consistency of general £i -penalized M-estimators developed in Section [3] for increasing sample sizes, the 
proportion of paths containing sign correct model approaches one as the sample size increases whenever ri{d) > 
0. For ri{6) < 0, the chance of correct sign recovery are low throughout. Not surprisingly, the asymptotic 
approximation works better for smaller p. Also notice that the fainter signal of the "blip" profile makes the 
recovery of the correct signs harder. 
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Figure 2: Proportion of sample paths containing a sign-correct model vs. GI index for mixed Gaussian 
predictors: The proportion at each point is based on 50 repHcates of the sample regularization path for the 
corresponding design. As in Figure[T] the results displayed in these panels show good agreement with the theory 
for sign consistency of general ^i-penalized M-estimators developed in Section [3] for increasing sample sizes, 
the proportion of paths containing sign correct model approaches one as the sample size increases whenever 
r]{9) > 0. It is interesting to notice that a high probability of correct sign recovery is possible even if ri{9) < 
(see the S VM estimates under the "blip" profile) but it does not approach one asymptotically. Also notice that the 
fainter signal of the "blip" profile makes the recovery of the correct signs harder when 77(6*) > 0, especially for 
the logistic classifiers. 
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The results shown in Figure|4]suggest that the comparison of logistic and S VM classification based solely 
on the GI condition should be taken with a grain of salt. Two factors are involved here: first, the results are 
based on asymptotic approximations and, second, a negative GI index does not necessarily imply a low 
probability of correct sign recovery (though such probability is known not to approach one). While in most 
cases the two methods are comparable in their ability to contain a correct model in their regularization path, 
SVM does seem to have some advantage over logistic under Gaussian predictors and the "blip" conditional 
profile even at large sample sizes (n = 1,000). For smaller sample sizes (n = 100), SVM did perform 
markedly better than logistic regression under mixed Gaussian predictors and the logistic profile. 



6 Discussion and concluding remarks 



In this paper, we have extended t he asymptotic characte rization of the distribution of LASSO estimates (i 



penalized least squares) given by 



Knight and Fu 



(I2OOOI) to more general loss and penalty functions in the 



parametric case. The key to our extension consists of finding conditions under which it is possible to obtain 
a local quadratic approximation that is uniformly valid on a neighborhood of the risk m inimizer . Give n the 



Pollard k99l\ ) was 



widespread use of convex loss functions use in the literature, the Convexity Lemma by 
our tool of choice. As we restrict attention to the parametric case, we have been able to keep the study of 
loss and penalty functions separate. To the possible extent, we have state our results in a modular fashion so 
they can be applied to various combinations of loss and penalty functions. 

We have used the asymptotic characterization of the distribution of ii penalized parametric M-estimates 
to obtain sufficient conditions ensuring the existence of a model selection consistent estimate for some 
appropriate value of the regularization parameter. Interestingly, the condition involves the H essian but not 



Ravikumar et al. 



(I2OO8I) have obtained 



the variance of the score function evaluated at the risk minimizer. 
a similar condition in the non-parametric case (p S> n) for the penalized maximum likelihood estimate of 
Gaussian covariance matrices. That suggests the results we present in this paper can be extended to the 
non-parametric setting under appropriate conditions, which will be the theme of future research. We also 
show (Theorem |6l) that, under appropriate assumptions, the condition for sign-consistency of £1 penalized 
parametric M-estimates can be expressed solely in terms of the matrix of second moments of the predictors. 
Our simulations provide ample empirical evidence to the theory we have presented in the context of 
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Figure 3: Logistic GI vs. SVM GI for 500 designs: The shaded area shows where logistic regression is model 
selection consistent and SVM is not. Under Gaussian predictors (the four leftmost panels), the GI indices are 
exactly the same for the SVM and logistic classifiers, as expected in view of Theorem|6]and Lemmas [8] and |9l 
For mixed Gaussian predictors, the logistic regression classifier is model selection consistent slightly more often 
than SVM under the logistic conditional probability profile and, surprisingly, much more often under the "blip" 
design. Recall, however, that SVM was shown to have high probability of correct sign recovery even in cases 
with 77(61) < 0. 
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Figure 4: Comparison of the proportion of sample paths containing sign correct estimates in finite samples: 

The GI condition (Theorem|5]l concerns an asymptotic guarantee and does not ensure the probability of correct 
sign recovery to be low if 7^(6'). In these plots, we compare SVM and logistic classifiers in terms of probability 
of correct sign recovery in finite samples. The SVM classifier seems to perform better in terms of the probability 
of correct sign recovery under Gaussian predictors and the "blip" conditional probability profile. The SVM 
classifier also performs better in smaller sample sizes under mixed Gaussian predictors and the logistic conditional 
probability profile. 
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rofile 




SVM 
not MSC MSC 






SVM 
not MSC MSC 


55 


Logistic not MSC 
Logistic MSC 


31.8% 6.0% 
7.0% 55.2% 




Logistic not MSC 
Logistic MSC 


15.8% 4.8% 
5.4% 74.0% 












rofile 




SVM 

not MSC MSC 






SVM 

not MSC MSC 


Oh 

tM 


Logistic not MSC 
Logistic MSC 


31.8% 5.2% 
26.0% 37.0% 




Logistic not MSC 
Logistic MSC 


16.6% 3.6 % 
32.2% 47.6% 













Table 1: Frequency at which SVM and logistic are model selection consistent: Each table shows the pro- 
portion out of 500 designs with mixed Gaussian predictors in which the ^i-norm penalized SVM and logistic 

classifiers are model selection consistent (MSC). For most designs, both SVM and logistic would asymptotically 
contain estimates with all signs correct in their regularization paths. Among the cases where only one of the two 
classifiers had would asymptotically contain a sign correct estimate in its path, the logistic classifier would be the 
correct one in most cases. 
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SVM and logistic regression classification. For Gaussian predictors and a given design, one of the two 
can happen: both logistic regression and linear SVM classifiers will are sparsistent and sign-consistent or 
neither of them is. In finite samples, SVM seems to enjoy a slight advantage in picking the correct signs 
in the cases we simulated. For a set of randomly selected designs with non-Gaussian predictors, logistic 
regression classifiers were sparsistent and sign-consistent more frequently than SVM classifiers. In finite 
samples, however, the evidence in favor of either SVM or logistic regression classifiers was mixed. 
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A Proofs of theoretical results 
A. 1 Proof of results in Section |2] 

We now state and prove the results in Section |2l Before that, we prove technical Lemma [TOl which is used 
in the proof of Theorem |4l 

Lemma 10. Define: 



Xn,u) := ^ 



i=l 



L{Z„9+ — ] -L{Zi,e) 

Qn 



+ A, 



u 



T{9+ — ]- T{9) 

Qn 



Then: 



arg mm 

neMP 



in), 



(A-1) 



Proof of Lemma liOl From the definition of 9n{Xn), we know that: 



arg mm 

tee 



arg mm 
tee 



E 

n 

E 

,i=l 



L Z 



L{Z,,( 



qn {t - 0) 



qn {t - 9) 



qn (t - 0) 



T 



qn {t - 0) 



Tie) 
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The result follows from making a variable transformation u{t) = qn-{t — 9) and letting u = u{9n{Xn))- □ 



Proof of Theorem\l} a) The conclusion in (a) follows easily from the triangular inequality, since for each 
compact set K: 



sup 



vi''\Fn,K,u)-Ve{W,u) 



< sup 

u€K 



+ sup 



i=l 
An • 



U 



L[Zi,d + -] -L{Zi,e) 



CeiW,u) 



u 



T[e + —]- T{e) 



A • Ge{u] 



b) Define: 



Un = q-n- {d{K) - 

u = argmin [CeiyV, u) — X • Gg{u) 



For any compact set ET^ and each n, we know: 



\Un -u\ > 6) 



Un — u\ > 6 



+P \Un -U\> 5 



Un ^Ke]-F{Un^Ke 



< ¥[\Un-u\> 5 



Un^ KA + ¥{Un^ Ke) 



Since is Op(l), there exists a compact set K,, such that lim P ^ K^) = and thus: 



lim P |u„ — -ul > 6 

n— >oo 



Un^K,]-V{Un^ Ke) < lim P {Un i^s) = 0. 

To show the second term vanish, the uniform convergence over compact sets gives that: 



lim P ( \un — u\ > 6 



UneKA = 0, 



which concludes the proof. 



□ 
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Proof of Lemma\2\ Proof of a) pointwise convergence: To establish pointwise convergence, define: 



6i,n{u) = L{Zi,e + ^)-L{Zi 

Bn{u) = Y.7=l^i,n{^)- 



(A-2) 



In terms of tliese definitions, we iiave: 



1 " 

-E 

n ^ 

i=l 



L Zi,e + 



I," -I ^ I — L {Zi,6) 
n 



Bn{u) = n- 

i=l 

Now, because 9 is optimal we have EDj = and, thus: 

n 
i=l 

Summing E and subtracting Y17=i ^ [^■i,n{u)] from the right hand side of (IA-31 ): 

n 

Bn{u) = E[Bn{u)]+W^U + "^[RLn{u)-E[Ri^n{u)]]. 



(A-3) 



i=l 



Pointwise convergence for each u follows from obtaining a quadratic approximation to E a 
weak convergence for W'^u and proving that the last term is Op(l). These facts are established next, 
i) Quadratic approximation to E [Bn{u)]: 



First notice that this term is just the difference of the risk function evaluated at 9 and 9 + 



E[Bn{u)] = n-E 



1 " 

n ^ 

1=1 



L(Zi,9 + —] -L{Zi,i 



n 



n ■ 



L Z,6 + 



E 



R{e + -^] -R{9) 

n 



E[L(Z,0)] 



Since the risk function R is twice differentiable (L2.c) and 9 is optimal, the gradient of the risk with 
respect to its argument t must be zero at t = ^. In addition, for H{9) as defined in assumption L2.c, we can 
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write the approximation: 



n ■ 



u 



n 



H{e) 



+ 



n 



\u\ 



■H{e)-u + o{l). 



ii) ] Weak convergence of W^u: 

Optimality of 9 and differentiability of the risk function imply that E [Di] = 0, thus E [Wn] = 0. 

Since \/i)L{Z, b) exists almost everywhere, all terms in the summation defining almost surely exist. 
Since the terms in the summation are i.i.d. and each has finite variance, the Central Limit applies and we 
can conclude that: 

Wn^ N (0, J{e)) , with J{9) = E [VtL(Z, e)VtL{Z, 6)'^] . 

iii) Eti [RiAu) - IE [RiA^)]] is Op(l): 

Let = Ri^n{u) — E [Ri^n{u)\ ■ Since convergence in quadratic mean implies convergence in probability, 
it is enough to prove that P IX^j'Li ^if = o(l). 

Clearly E^, = for all i. That, along with independence across the observed samples, yields: 



E 



i=l 



var 



.i=l . 

n 

i=l 

n 

i=l 

n 

< ^ (e 



i=l 



Because L{Zi,t) is differentiable att = 6 for almost every Zi, we have that: 



i^i,n(w)r 



L{z,,e + 



L{z,,e)-VbL{z,,e)- 



n 



for almost all Z, . 



We conclude that E 



Proof of b.l) uniform convergence over compact sets: 

Uniform convergence of Bn( u) over compa ct sets follows from the pointwise convergence just proven 
and the Convexity Lemma due to lPoUardl (1199 ih . 
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Proof of b.2) boundedness of ^/nr9(0}: Our proof of y^-boundedness of the un-penalized estimate is an 
adaptation of an argument due to lPollardI (|l99lh . As a first step, we "complete the squares" in the quadratic 
approximation by letting C be a decomposition of the (non-singular) Hessian matrix, i.e. C^C = H{9). 
We then write Bn{u) as: 



Bju) 



1 



Let An denote the ball with center — i(C^^)^W„ and radius 6 > 0. Since W„ converges in distribution, 
it is stochastically bounded and, hence, a compact set K* with probability arbitrarily close to one can be 
chosen to contain An. Thus: 

A„ := sup |r„(n)| 0. 

We now study the behavior of Bn outside of An to conclude that 9n{0) is consistent. To do that, let z be a 
point outside the ball and define: 



m 



Z-^{C-YWn 

m 



Because of convexity, we have that for u* = —^C {C ) W„ + 6 ■ v on the boundary of the An ball: 



m 



Bn{z) + {l--]Bn{liC 



-1\T 



> Bju* 



> inf {v^H{e)v) - -Wl [H{e)]-^ W„ - A, 



\v\<l 



1, 



> 6\mi {v^H{e)v) - -Wl [H-\6)] W„ - A, 

|i;|<l 4 

> S^Ap{H{9)) - Bn (^-^C-\C~YWn^ - 2A, 
where Ap{H{6)) is the smallest eigenvalue of H{9). We then conclude that: 

inf Bn{u) > Bni-liC-yWn) + ^ [6^Ap{Hie)) - 2An] . 



Since A„ ^ 0, we have that with probability approaching one that \u + ^(C ^)^W„| < 6 and the result 
follows from recalling that u = \/n(^n(0) — 9). □ 
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Proof of Lemma\3l We first brake the problem into two easier to handle pieces: 



sup 

u&KcRP 



u 



T{9 + —)- T{9) 



X-Geiu] 



< sup 

u£KcM.P 



+ sup 



A 



T{9 + u-q-^)-T{e) 



qn 



T{e + u.q-^)-T{e) 



.-1 



Ge{u] 



Since T is continuous, for the compact set K (ZW there exists < Mk < oo such that: 



sup 



A. 



A 



T{e + u-q-')-T{e) 



qn 





< 











Mk 0, as n ^ oo. 



For the second term, we know from condition P4 in Assumption Set 2: 

T{e + u-q-^)-T{9) 



lim 



r T{9 + h.u)-T{9) 

hm ; = Geiu). 

hiO h 



Because Gg is assumed continuous, the pointwise convergence can be strengthened to uniform convergence 
over compact sets. □ 

Lemma 11. Let 9n{\n) be as defined in qn be a sequence such that qn ^ oo as n ^ oo and A„ be 
a sequence of (potentially random) non-negative real numbers. Assume T is a penalty function satisfying 
condition P4 in PA. If qn ■ 9n{^) = Op(l), then qn ■ 9n{y^n) = Op{l). 

Proof. First, we use a contradiction to prove that T{9n{\n)) < T{9n{Q)). From the definition of ^n(O), we 
have: 

-•^L(Zi,^„(0)) < -• J^L(Zi,0"„(A„)). 

i=l 1=1 

Supposing that r(^„(A„)) > r(^„(0)), we get 

- ■ 5^ L (Zi, 9n{^)) + \n ■ T{9nm < ' ' ^ (^^' ^"(^«)) + " ^(^"-(A„)), 



i=l 



i=l 



a contradiction with the definition of 9n{\n) as the minimizer of f{t) = Y^l^i L {Zi,t)] + A„ • T{t). 
Now, from qn • ^n(O) = Op(l), we have that, for any 6 > 0, there exists compact Kq C such that 
\n ■ On{0) G Kq) > 1 - 6. Let U = mayiteKo Qn ■ T{t) and define Kq = {t Q : qn ■ T{t) < U}. 



Since T{9n{Xn)) < ^(^(0)), it follows that P (^(A^) e Koj >F (^9{0) 



eKo) >l-6. 



□ 
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A.2 Proof of results in Section |3] 

Proof of Theorem |5] For the £i -penalty, the difference between A„ + — ||/3||i^— -^(u^ — || 1 1 , 
0, uniformly as n ^ oo. Using Theorem [T] and Lemma |2l 



u := argmm 



'n 



with W A'^ (0, J{0)). We assume, without loss of generality that /3_4 > with the inequality holding 
element-wise. In that case: 

sign(/3j(A„)) = sign(/3j) ^ > -fij, for j G A, 
sign(4 (A„)) = sign(/3j) <^ uj = 0, for j e 

For the remainder of this proof, we drop denote H{6) by H. In terms of the a, A, A^ partition, the Karush- 
Kuhn-Tucker (KKT) conditions for optimization defining u above are 

iiot.A ■ UA + Ha,A'' ■ UAc + Ha,a ■ Ua + = 0, 

Ha,A • UA + Ha,A'= ■ UA'^ + HA,a • u^ + - = 0, 
Hj,A ■ UA + Hj^A'' ■ UA'' + -f^i.a ' + - ^ • sign(nj) = 0, for j G A"" s.t. uj / 
\Hj^A • UA + Hj^A'^ • UAc + i^j> • + Wj| < for j G s.t. u^- = 0. 

To select the zero terms in /3 correctly, we must have ua''=o- that case. 



Ua 






HA,a 


-1 




UA 




HA,a 


Ha,A _ 







Using Schur's inversion formula for partitioned matrices, we get: 

UA = [Ha,A - HA,aH-XHa,A] • • 1, - - [/^A^^a.i] • W, 

L V 

Define a zero mean Gaussian random vector W = W^, W_4c 

-n . [Wa + HA,a. ■ H-}^ ■ W„] , and 

r H^r H^^r n 

Ti7 TIT- Ha-^ a Ha.a Ha.A Wq 

W^c := W^c — ) With 

n := [Ha,A-Ha,c.H-^o.Hc.,aY^ 
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The M-estimated parameter fails to have the correct signs if: 



w 
w 



> y^- Pj- J2keA ^Jfe' for 3 e A OR, 

> ^(l-HA^^AlHA^A-HA^cH-^^H^^Ar^lgj, for some je^^ OR, 

j < ^(^-1- H^c^A [Ha,a - HA,aH-^aHa,A]~^ u) ' some j e A". 



For the remainder of the proof, let $ denote the standard normal cumulative distribution function, and define 



:= var(W,) 



Proof of a): 

We prove that if ^ [^A,A ~ i^a,^] ^ Igllcx) < 1 and there exists c > and A G M sucg that 

0, then the probability of each of these three events decreases to zero exponentially 



n 

fast. 



A and 



For the first event, use the union bound and the inequality 1 — $(r) < exp 



get 



for large enough r, to 



(ft-^)) s I:p(w,>vs(a-^ 



< 



jeA 
jeA 



13. 



<;j n ■ 



< 9' 



n • max <ri 



exp 



exp 



sjn ■ [ min ( — ) 



n-max 



— ^ mm 

2 jeA 



xfn ■ min ( — ) 



To tackle the second and third events, define = 1 — ||-ff^c_^ \iiA,A — -f^^,a-f^a,a^a,^] ^ Iglloo and 
notice that 



[W, > ^ (l - iiA^,A {Haa - HA,aH-l^H^,A\ 1,) j < P (^VK, > • j , for all j e X, AND 

As a result, using the union bound gives that the probability of the second or third event happening is 
bounded above by ^j^j^c F (^Wj > rj ■ . To prove this probability vanishes exponentially fast, we 
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use the same inequality as above: 



E 



> 77. 



< E^pi 

jeA" 



> 



'^3 
1 - $ 



rj_ K_ 



< 2{p-q)- 1 - $ 

< 2{p — q) ■ cxp 
~ 2(p - q) ■ exp 



max \/n 



1 / V K 

2 1 niax<j,- Jn 
\3eA- ■' " . 



^ c ■ r, 
- — ■ n • mm — 
2 Q 



Proof of b): 

To prove the converse in part (b), first notice that W^^^ is a positive definite matrix. It follows that 
"^jeA YlkeA ^jfc ^ ^' ^^^^ ^^^^ there must exists j £ A with X^j^g^ Hjfc > 0. Thus, if ^ ^ cxd, the 
first event takes place with probability approaching one (exponentially fast) as long as A is non-empty. On 
the other hand, if ^ ^ 0, then the union of the second and third event occurs with probability approaching 



l + c 



A, for some finite A G M and 



one (exponentially fast). Thus, we only need to consider the case 
CG [0,1). 

As before, let 77 = 1 — _4 [Hj^^yi^ — Hyi^^aH^^Ha^A] ^ Iglloo- If c > and r/ < 0, the probabihty 
of the second or third events converges to one (exponentially fast). If 77 = 0, the second or third events have 
a positive probability of taking place regardless of A„. Likewise, if c = 0, < 0, the second or third events 
happen with strictly positive probability. □ 

Proof of Theorem^ Throughout this proof we denote u := -piy € M^, the unit vector in the direction of /3. 
Using the properties of Gaussian distributions and the condition i/^^/x = 0, we get 



E[XX^|a + X^/3] 



MM 



(a+X^/j)-(a+;3^M) 



- 1 



For details, we refer the reader to Appendix IB. 11 Letting /m denote the density of the random variable 
M = a + and defining 



m 



[a 



+ /3^m)' 



• w(m) • fM{m) • dm, 
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the Hessian of the risk function becomes: 

H{e) 



yujl + S + K 



Partition the vectors v, fi, and the matrix S according to the sparsity pattern in u: 

T 



E 



-5 0^ 



T T 
/J._4c 



and 



The partitioned Hessian becomes 

Hie) = 

Defining A := (Ig + k • vjs.u'^ ■ S^,^), we get 



Ha^.a{0) [Haa{S)V^ = {t^A^t^A + ^a^a) ■ A X [{^JiA^JiA + ^a,a) ■ A] ' 
= (Ai-4'=Ai3i + ^a-a) aa"^ {^J'A^^'A + ^aa) 

= [E(X^c,^)][E(X^,^)]-\ 
The result follows from post-multiplying both sides of this last equation by sign(/5_4). 

T 



□ 



Proof of Lemma^ Throughout the proof of Lemma [H we define X 



1 X 



T 



LI) We first prove the existence of a minimizer. Given that the risk function is continuous, it is enough to 



prove that the closed set S{M) = |t G : E 



< M 



-I(Y = 1) • X^i + log (^1 + exp {y-^t^ 
for large enough M is bounded. We establish boundedness of 5(M) by proving that 5(M) is con- 
tained on a finite sphere around the origin which can be established by proving that, for any M there 
exists 7 such that: 



\{t:t)\>i 



E 



-I(Y = 1) • X^ t + log ( 1 + exp ( X^t 



> M. 



(A-4) 



To prove the assertion in (IA-4b . let t be a non-zero vector with u = {t, t) / 0, so t can be written 
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ast = uv, for some v ^0. For any uq G M, we can write: 



E 



• I(Y = 1) • X^?; + log 1 + cxp M • X^u 



-u ■ E 



p(X) • X^w 



-E 



log ( 1 + exp I u ■ X.^v 



■ •E 



p(X) • X'^v + E log (l + cxp (^uo • X^v 



-E 



> 



l+exp(wo-X'^u) 



(u - -Uo) 



l+exp^uo'X^i'^ 



p(X) -X^^ 



where the inequahty follows from convexity of the mapping s >—>■ log and 

cxp(?t()-X^?)) 



c(«o, f ) = E log ( 1 + exp ( lio • X 
u. 



E 



l+exp^UQ-X'^t;^ 

Under the assumptions made, we can use the dominated convergence theorem to get: 
exp ^uo • X-^w^ 



Uo, which does not involve 



lim E 

uo— »oo 



lim E 

tlQ— > — OO 



1 + cxp ^Uq ■ XJ'v^ 
exp ( Uq ■ X-^ti 



1 + exp (^uq ■ XJ'vj 



-p{X.) I -X^v 
p(X) I • ±^v 



E 



-E 



(l-p(X)).X^i; 
p{X) ■ X^v 



, and 



Since the density is everywhere positive, the hyperplane {s G M^'+^ : s^v = 0} has probability zero 
for any f / and thus we have either E (X-vJ > or E (XvJ < 0. 

If E (xvj > 0, pick Uo large enough so E (1 — p(X)) • X^v > to conclude that 



lim E 

u— »oo 



-I(Y = 1) • X^t + log (l + exp (x^i)) 



OO. 



If, on the other hand, E (xvj < 0, pick uq small enough so E p(X) • X 

-I(Y = 1) • X^t + log (l + exp (x^t)) 



< to conclude: 



lim E 

u— »oo 



OO. 



This establishes that for any M there exists 7 such that the risk function exceeds M and completes 
the proof of existence. 

The proof of uniqueness follows from strict convexity of the risk function. We prove that below by 
showing that the Hessian matrix of the risk function is everywhere strictly positive definite under the 
assumptions made. 
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L2) For the canonical logistic regression loss function, we have: 



E[|L(Y,X,i)|] = E I(Y = l)-X^i-log(^l + exp(|x^t)) 



< E 

= E 



X 



X 



T 



t + 



log ( 1 + exp ( X^t 



T 



t + E 



log (l + exp (X^i)) 



where the equality follows from exp(X^t) > for all t. Because E [XX^] < oo, there exists C 
such that E [|Xj |] < C for all j = 1, . . . , p and the first term of the sum is bounded above. To bound 
the second term, write: 



E 



log (l + exp (^X^t) ) < E log (^1 + exp X^t ) ) 



< log(2) + 2-E \±^t\ 



where the first inequality follows from h{u) := log (1 + exp (u)) being non-decreasing and the sec- 
ond stems from h having derivatives bounded above by 1. The result now follows from E |X| being 
bounded. 

L3) The canonical logistic regression loss function is twice differentiable everywhere, with: 

exp (x.^t ' 



VtL{Y,X,t) = 
V|L(Y,X,t) = 
For all Y, X and t G MP+\ we have: 



I(Y = 1)- 



1 -I- exp 



(xn) 



• X, and 



exp X^ t 



(l + exp(xrt))' 



XX^ 



(2Y - 1) 



exp {X.'^t 
1 + exp (X^t^ 

exp (x.^i 



< 2, and 



2 — 



< 1, 



(l + exp(x2^t)) 
so, using the assumptions on the moments of X, we know that 



E[|VtL(Y,X,t)|] < 2max 1, maxE|X,- 

[ i<j<p 

E[|vfL(Y,X,i)|] < E[Q(X)]<oo. 



< oo, and 
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Using the Dominated Convergence Theorem we get that: 



VtRit) = E[VfL(Y,X,t)] =E 



E(I(Y = 1)|X) 



exp(X^t) \ 
1 + exp(X^t) J 



X 



E 



P(X) 



exp(X^ t) 
1 + exp(X^t) 



X 



and 



V^R{t) = E [V?L(Y,X,t)] =E 



exp I X t 



1 + exp (X^t 



•XX^ 



We now prove that the population risk minimizer 6 for the logistic regression is unique under the 
conditions of Assumption Set 3, by proving that the Hessian V^R{6) is a strictly positive definite 
matrix. 

From the assumption that E [Q(X)] is strictly positive definite and bounded, we get that: 



E[Q(X)] 



lim E[Q(X) -IdlXll < s)] , 



and thus, there must exist large enough S such that E [Q(X.) ■ II(||X|| < S)] is strictly positive definite. 

Let £ = infiixiKs" exp(a+x f^) Because ||X|| < 5 is a compact set and exp(a+x /3) ^ 

[i+exp(a+XT/3)]' II II - F [l+exp{Q+XT/3)]' 

for all X, we get that £ > 0. 

In what follows, the binary relationship between matrices A and B indicated by B means A — B 
is positive semi-definite and its strict version A >- B means ^ — i? is strictly positive definite. Now, 



v|i?(e) 



E 
E 



exp(X^i) 
(l+exp(Xl't)) 
exp(xTt) 



^ • Q(x) 

^•Q(X)-I(||X|| <S) 



(l+exp(xrt)) 

h ^E[Q(X) -miXll < S)] +E 



+ E 

cxp(xTi) 



cxp 



(x-t) 



(l+cxp(xrt)) 

• g(x).i(||x|| >5) 



^•Q(X)-I(||X|| >5) 

y 0, 



where the last generalized inequality follows from E 

C > 0, and E [Q(X) • I (||X|| < S)] >- 0. 



Q(X) 



cxp X^ t 



(l+cxp(XTt))-' 



m>s) 



h 0, 



L4) The loss function corresponds to the neg-loglikelihood function of a canonical exponential family and 
is thus convex. As the risk is an expected value of convex functions, it is also convex. 

□ 



Proof of Lemma^ Throughout the proof of Lemma |9l we define X 



1 X^ 
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LI) We first prove that a minimizer exist. Given tliat tlie risk function is continuous, it is enough to prove 

< M > is bounded. 



that for large enough M the closed set S{M) = |t e : E 1 - YX^t 

We establish boundedness of S{M) by proving that S{M) is contained on a finite box around the 
origin. Letting Cj be a unit vector with a 1 in its j-th entry and zeroes in all other components, it is 
sufficient to prove that, for any M and each j = 1, . . . ,p + 1, there exist ^j^M such that: 



\{t,ej) \ > 7j,A/ 



E 



1 - YX^t 



> M. 



(A-5) 



To prove the assertion in (IA-51 ). let t be a non-zero vector with u = {t, ej) 7^ 0, so t can be written as 
t = ucj + V, for some v with (w, Cj) = 0. The risk function at t becomes: 



E 



1 - YX^ t 



= E 
> E 



1 — u - X"^ei — X"^t; 



u ■ X"^ej + X"^i; 



(Y = r 



+ E 



l + u- X^e,- - X^v 



I(Y = -1) 



> inf E 



u ■'K. 60 + X V 



\u\ ■ inf E 



1, 



where 0{ej) is the set of all vectors orthogonal to ej. Because ej has unit norm, ||ej + f || > 1 for all 
V G 0{ej) and it follows that {ej + v : v £ 0{ej)} C {v : \\v\\ > 1}, yielding: 



E 



1 - YX.^t 



> u 



inf E 

v:\\v\\>l 



inf E 

d:||i'|| = 1 



1, 



where the equality follows from noticing that 



X^v 



is increasing m\v\. If we can find c > 0, such 



that inf„.||^|j=i E 



X^v 



> c, it is possible to find the 7^- a/ we want. To find such a positive lower 
bound, define the compact set K = {x G MP+^ : ||x||2 < C} for some constant C. Since it is 
assumed that /x(x) > is continuous for all x G M^', we get that /* = minxgK > 0. Now, letting 
^(A) denote the Lebesgue measure of a set A C M*'+^, we have: 



inf E 

v:||d|| = 1 



X^v 



> r] ■ inf ] 

i':|jD||>l 

> inf : 

v:\\v\\>l 



X^v 



X^v > r], and X G K 



- f* ' ({x : X v> r/}) . 

i;:||i'||>l 

= /* • ^ ({x : x^ei > r?}) =: > 0, 
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where the last equality follows from noticing that, because of symmetry: 



/X ({x : X v > r]}) = fi ({x : x ei > r]}) , for all v G 



: \\v\\2 = 1. 



Using the strictly positive lower bound afforded by c^, we get: 

I I I/. M M+1 ^ . ^ 

1^1 = K*>ej)| > 7j,M := —Fi , for some J = l,...,p 

J 



E 



1 - YX^ t 



+ 



> M. 



Uniqueness of the minimizer follows from strict convexity of the risk function under the assumptions 
made. Strict convexity of the risk function in its turn is proved below, by showing the Hessian matrix 
for the risk is everywhere strictly positive definite. 



L2) For all t G M*'+^: 



E[|L(Z,t)|] < E max I 



max si — Xt 



1 + Xt 



}]<- 



1 + E 



Xt 



< 1 + Wi* -E 



X^X 



t, 



which is bounded given the assumptions on the distribution of X. 



L3) The hinge loss is not differentiable on the set {X : Xt = 1 or Xt = — 1}, which under the assumed 
conditions has zero probabihty. At all other points, hinge loss function has derivative with respect to t 



VtL{Z,t) 



X- 



I(Y = 1) ■ I(Xt - 1< 0) - I(Y = -1) • I(Xt + 1 > 0) 



To obtain the Hessian, write the SVM risk as R(t) = Ri{t) + R2{t), with 



Ri{t) := E 
R-2it) ■■= E 



, and 



p (X) ■ {l-±t^ ■l{l-±t> o) 
{1-p (X)) • (^Xt - l) ■ I (^Xt - 1 > 



first show that ri(t) := Ri{t + At) - Ri{t) = Vt-Ri(t) • At = o(||At||). To do that, let dfe - ^ 



LetVt-Ri(t) :=E 

first show 
and write 

Rt{t + Mk)-Ri{t) 



p{t) • I (^1 - Xt > o) • X and VlRiit) := E p{X.) ■ (5(1 - Xt) ■ X^X 



.We 



E 



p(X) (l - X<) I (l - Xt > XAt) - I (l - Xt > o) 



XAt 



X 



dfe 



-E 



p(X) • I (l - Xt > XAt) • X 



Using the Dominated Convergence Theorem to take the Umit as At J, and collecting the Umit of 
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the multiplier of yields 



VtRiU) = lim I 

\At\^0 



p{X) (l - Xtj I (l - > XAt^ - I (l - > o) 



lim E 

I At 1^0 



p(X) • I (l - Xt > XAtj ■ X 



X 



= E 



-E 



p{±) (l - Xt) ^(1 - X^t) ■ X 
p(X) • I - xt > o) • X 



-E 



p(X) • I (l - Xi > o) • X 



To obtain the second differential for Ri{t), write the residuals from the approximation from the first 
differential: 



ri(At) 



dfcE 



p(X) • X"^ • I (l - Xi > XAtj - I - Xi > o) • (l - Xt) • X 



-dfc-E 



p(X) . X^ 



At'^XX^At 
(l-X.t> XAi) - I (l - Xf > o) 



dfc 



•X 



XAt 



■ dfc 



The second derivative is obtained using the Dominated Convergence Theorem to compute the Umits 
of the terms in the sum. For the second term, the Umit follows directly from pointwise convergence 
to a Dirac delta function: 



lim E 
I At 1^0 



p(X)-X-[l(l-X^t>X^At)-l(l-X^t>0)]-X^ 
X^At 



-E 



p(X) • 6{1 - X^t) • XX^ 



-1 



be a linear rotation 



To obtain the Umit for the other term, let W = i?x = wi W2 

of X such that wi = x^t, and let Fx, F^^, and /wi|w2 denote the distributions of X, W2 and the 
conditional distribution of Wi given W2 respectively. Then write 



p(X)-X'^-[l(l-Xt>XAtfc)-l(l-Xt>0)]-(l-Xt)-X 

A/,f XX^Affc 
(l-x;,)'[](l-x/>xA/|,.)-L(l-x/>Q)] 



At'J x^x^Affc 



p(x) • XX dFx (x) 



j- (l-Wi)-p(l-wi>x(w)Atfc)-l[(l-wi>0)] 
AtJi(w)i-''(w)At|i 



s(wi,W2)ciF^i|w2(wi |w2)dFw2(w2), 



where we used the notation s(wi, W2) := p(x(wi, W2))x(wi, W2)x(wi, W2)'^. 
To obtain the Umit, write the inner integral as: 



(l-Wi)-[l[(l-Wi>x(wi,W2)Atfc)-I(l-Wi>0)] 

AtJ'i(wi,W2)i^(w)Atfc 
^;-[l[(D>x(wi,W2)Atfc)-l[(^;>0)] 
AtJ'x(l-t>,W2)x^(w)Atfc 



,s(wi,W2) dF^i|w2(wl |W2) 

s(l-U,W2) dF^^i^.^{l - V \W2) 



■/ [i • S(W1,W2) • 6{1 - Wi)] dF^,|w2(wl |W2) + o{\[At\\). 
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Plugging that back into the expression for the expected value, we get: 



-/[/ 
-i.E 



p(X)-X^-[l(l-Xt>XAtfc)-l(l-Xt>o)]-(l-Xt)-X 
AtrxX^AU 



i • s(wi, W2) • S{1 - wi)] dF^^i^^{wi |W2)] dFw2(w2) 
p(X) • XX^ • (5(1 - X^t)l . 



Summing the two terms (and taking into account the factor ^ in the Taylor expansion) yield 

VjRiit) = E fp(X) • 6{l - X^i) • XX^ 



For R2, analogous steps yield 



Vti?2(t) = E^(^l-p(X)j •I(^l + Xt>0j -X 
Vfi?2(i) = E f (1 - p{X)] ■ (5(1 + X^i) • XX^ 



and 



The result follows from summing the differentials for Ri{t) and R2{t). 



VtRit) 



E 



l-p(X)) •l(^l + Xt>o) -p(X) -iJ^l + Xi > 0)) -x 



, and 



V?i?(t) = e[((i-p(X)) •,5(l+X^t)+p(X)-(5(l-X^t)) • 



XX^ 



Finally, we prove that the minimizer of the SVM risk is unique by establishing that Vf i?(0) is strictly 
positive definite. To do that, first write: 

V^tRit) = E[Q(X)-I(Y = l)|a + X^/3 = l] •/(!) 

+E [g(X) • I (Y = -1) |a + X^/3 = -1] • /(-I) 

= E[g(X)|Y = l,a + X^/3 = l] •p|^Y = l a + X^/3 = l^ • /(I) 



+E [Q(X)|Y = -l,a + X^/3 = -l] •p||y = -1 



a + X^/3 = -l -/(-I), 



where / denote the density of the random variable a + X^/3. Given assumption C2, /(—I) > 
and /(I) > 0. In addition, assumption C3 gives that P (Y = -1 |a + X^/3 = l) > and 
P(Y = 1 \a + X^l3 = 1) > 0. It is thus, enough to prove that either E [Q(X) |Y = l,a + X^/? = -l] 
or E [Q(X) I Y = -1, + X.'^P = -l] is strictly positive definite (or both). 

Define = the unit vector in the direction of /?. The condition a + X^/? = k is equivalent to 
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= ^p^' SO for any scalar k: 

E[Q(X)|Y,a + X'^/3 



E 



Q(X) 



K — a 



Then notice that: 



E 



Q(X) 



Y,X% 



K — a 



^ ( ^^^r-TT^ ] (v/3vj) + var ( X 



Y,X% 



K — a 



where A ^ i? denotes that ^ — is positive semi definite. Because var [X |Y] is assumed to be 
non-singular, var ^X Y,X^v/3 = -^^^ has rank p — I and var ^X Y,X-^v^ = = 0. 



Thus, as long as k / a, E 



Q(X) 



Y,X^v 



is strictly positive definite. 



If a {"li 1}» both terms in the sum defining V^R{6) are strictly positive definite. If a G {—1, 1}, 
one of the terms in the sum defining VfR{6) is singular, but the other is necessarily strictly positive 
definite. Thus, it follows that VfR{9) is strictly positive definite as stated. 

L4) We can write ||1 — Xf||_(_ as the maximum between the constant function and the function Y — Xf 
which is linear - thus, convex - on t. Since it is the maximum between two convex functions on t, 
II Y — Xt||4_ is convex on t. A similar argument yields that ||Xt — 1||_ is convex on t. 

The loss function L(Z, t) is written as the sum (with positive weights) of convex functions, which 
proves that AL.IV holds for the SVM loss function. 



□ 



B Calculations for SVM and logistic risk Hessians in selected cases 

In this section, we first obtain expressions of the second moment of the predictors given the value of the 
margin variable a + X^/3 in the case of predictors X having a Gaussian and a mixture of Gaussian dis- 
tributions. Given the characterization of the SVM and logistic Hessians as a "weighted average" of such 
conditional second moments in Equations (l22l ) and (l23l) . the expressions for such conditional moments are 
useful in analytically comparing £i -penalized SVM and logistic classifiers with respect to their model se- 
lection properties. We then give explicit analytical expressions for the Hessian and logistic regression risk 
functions in the case of Gaussian and mixed Gaussian predictors. 

For the duration of this section, Z denotes a rotated version of X whose first component is the projection 
of X along the direction normal to the optimal separating hyperplane 7i{d). 
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B.l Conditional moments of Gaussian predictors given the value of one of its projections 

To obtain the conditional second moments used in the expressions for Hessians of the SVM and logistic 
regression risk functions, we first construct an orthogonal matrix S according to 



S :-- 



V U 



with U Sip y. {p — 1) matrix constructed using a Gram-Schmidt orthogonalization (as long as /3 / 0). By 
construction, U^v = and C/^C/ = The random vector Z = S'^X G is partitioned into a random 
scalar Zi = z^^X in the direction of v and ap — I dimensional random vector Z2 = J7^X orthogonal to v, 

1 T 



Zi Z2 



Conditioning on the margin variable M = a + X^/3 - defined in ([TSl l - is equivalent to conditioning on the 
Zi = i/-^X since: 



Q + X' /? = M <^ z^^X 



M - a 



^ Zi 



M- a 



Since S is orthogonal, X = SZ and 

E[XX'^|i^^X] = [ZZ^lZi] 5'^ = [5E[Z|Zi]] • [5E[Z|Zi]]^ + 5-var[Z|Zi] •5'^. 

For X ~ J\f{jj., S), Z is also Gaussian with expected value S'^fi and variance 5^S5. Partitioning the 
expressions for the expected value and variance of Z we get 



EZ 



T 



and var [Z] 



Based on these expressions and standard results on multivariate Gaussian distributions, we get: 



E[Zi|Zi] 
var [Zi |Zi] 

It thus follows that: 



Zi, E[Z2|Zi] = 
0, var[Z2|Zi] = U 



T 



and cov[Zi,Z2|Zi] = 0. 



E[X|z/'^X] = S'-E[Z|Zi = i/^X] =i/-E[Zi|Zi] + C/-E[Z2|Zi] 



var 



-vv 



/i, and 



[X|i/^X] = S^-var [Z|Zi = i^^X] -5^ 



) • {^^^) ■ 
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By noticing that UU'^ = U{U'^U)~^U'^ is a projection matrix on the orthogonal complement of the space 
spanned by v, UU'^ can be rewritten as UU'^ = lp — v {v'^v) ^ z^^ = Ip — vv^ . Using this expression for 
UU^ and some algebra, 



E[X|i.^X] = ^+[lp + (Ip 



,T\ T.V 



= + TO - i^V) , and 



(A-6) 



var[X|.^-Xj = 



From (IA-61) . the second moment of X given i^^X becomes 
E[XX^|i/^X] = E[X|i.^X]E[X|zy^X]^ + var[X|zy^X] 



(A-7) 



Given the Unear predictor variable as defined in (1181 ). the conditional first and second moments of X are 



E[X|M] 
E [XX'^IM] 



• z^, and 

m 



+ 



M-a-/3^/j 



B.2 Hessians for SVM and Logistic regression risk functions 

With the expression for the conditional second moments of a multivariate Gaussian variable given the value 
of one of its projections along the direction v = j^, equations (1231 ) and (|22l) give expressions for the 
Hessian of SVM and logistic regression risk functions. 



B.2.1 Hessians for Gaussian predictors 

To simplify the expressions, we partition the Hessian according to the intercept and the predictors X as 



with 



[H0{t)]^ p = [Hg{t)]^^. Throughout this section / denotes the density of the linear predictor variable M. 
Hessian for the Logistic regression classifier: Using the expressions derived in Section 14.1.11 we get 



(A-8) 



Kl + 



■ K2 
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where kq, ki and K2 are scalars given by 





= I 


Ki 


= I 


H2 


= I 



exp(r?i) 
(l+oxp(m)) 

m . 



f{m) ■ dm 

exp(m) 
(l+cxp{m)) 



f{m) • dm , and 



(A-9) 



1 



exp(m) 



f{m) ■ dm . 



(l+exp(m)) 

Hessian for the SVM classifier: Using the expressions derived in Section l4.1.2[ we get 



Kq ■ fi + Ki ■ lyj^ 



(A- 10) 



where kq, ki and K2 are scalars given by 



Ki + 



■ K2 



Ko = /(I) •P(Y = 1|M = 1) + /(-l) •P(Y = -1 |M = -1) 



/(i)-p(Y = i|M = i; 



+ 



• /(-I) 



+ 



m 



/(I) 



• /(-I) 



-1 |M = -1) , and 
1|M = 1) 

(Y = -1 |M 



(A-11) 



-1). 



6.2.2 Hessians for mixed Gaussian predictors 



When X is distributed according to a mixture of K multivariate Gaussians, the conditional moments of 
X involved in the expression for the risk Hessian can be written as a weighted sum of the corresponding 
conditional moments for each of the individual Gaussian components as detailed next. Letting vr^ denote 
the proportion of the mixture sampled from a multivariate Gaussian with mean and covariance matrix 
Sfc, for k = 1, . . . ,K, the density function of X is: 



K 



/(x) 



k=l 



1 



27r Si 



exp 



-\ ■ (x- ^fc)E^^(x- ^fc) 



The conditional second moment E [XX^ |M, ^u^; 5];^] given the margin variable M and that X was sampled 
from the component with mean fik and covariance follows from (IA-71 ) above. The first and second 
moment conditional solely on the margin variable M can then be computed as: 



E[X|M] 
E [XX^ |M] 



EiiiE[X|M,^fe,Sfe]-P(/Xfe,Sfe|M), 
EfcliIE[XX^|M,/xfc,Sfc] •P(/ifc,Sfe|M), 



and 
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where P(/ifc, Sfc |M) denotes the probabihty of a point having been sampled from the Gaussian component 
with center //fc and variance Sfc given the margin variable M. The distribution of M = a + is itself a 
mixture of Gaussians whose density / is 



K 

k=l 



1 



exp 



1 (m - a - 



An expression for P(/iA:, |M) then follows from using Bayes's theorem: 

P(/ij,Sj |M) : 



TTfc • (/J^Sfc/?) 2 • exp 


1 (M-a-/3T^fe)'' 




(rS^/3)"^-exp 
fc=i 


1 {M-a-^T^.^f- 

2 /3'J'S^/3 



Using (IA-81 ) and (lA-lOl ). we have that the Hessian for SVM and logistic regression risks are given by 

k=l ^ 



K 

fc=l 

K 



l^kA ■ 777 



and(A-12) 



[Hem 



K 

E 

k=l 



Kk,0 ■ {fJ'kfJ'k + ^fc) + '^k,l 



+ Kk,2 



where the scalars o> >^k,i and 2 for fc = 1, . . . , are computed according to the risk function. For each 
Gaussian component, the k^^q, k^^i and 2 correspond to the kq, ki and K2 scalars in ( IA-91 ) and (lA-llI) 
multiplied by the conditional probability of that component given the margin variable as indicated next. 
For the logistic risk and mixed Gaussian predictors, the k scalars are: 



Kfe,o = / P(Atfc,5]fc |M = m) • 
Kk,i = /P(/ifc,Sfc|M 
Kfc,2 = /P(/ifc,Sfc|M 



exp(m) 



m , 



m , 



(l+exp(m,)) 
T 

k 

mr 



fk{m) ■ dm 



exp(m) 



(l+exp(r 



m 



0)^ 



• /fc (m) • dm , and 



(A- 13) 



exp(m) 



(l+exp(m))^ 



fk{m) ■ dm 
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For the SVM risk and mixed Gaussian predictors, the k scalars are: 




(A- 14) 
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